vanilla sgd baseline
Comparison with the vanilla SGD baseline
We thank the reviewers for their comments. We will carefully modify the paper according to the suggestions.Figure 1: Comparison of different learning schemes on RotMNIST classification and IWSL T translation tasks. For the NMT tasks, we used the same parameter settings from previous papers, as described in section 5.2. Assistant model shows similar performance over different batch sizes. However, we will provide results on raw ImageNet dataset and large Transformer model in the revised version.